π§ Complete LLM Development Roadmap
Building Your Own Large Language Model & AI Service (Like Claude, Gemini, ChatGPT)
Scope: This roadmap covers everything from foundational math to deploying a production-grade LLM service β structured learning paths, algorithms, architecture, hardware, reverse engineering, and cutting-edge developments.
1. Foundation Prerequisites
1.1 Programming Languages
- Python (Primary Language)
- OOP, functional programming, decorators, generators
- Async/await, multiprocessing, threading
- Memory management, profiling, optimization
- Type hints, dataclasses, abstract classes
- C/C++ (Performance-critical components)
- Pointers, memory allocation, RAII
- CUDA extensions, custom kernels
- CUDA (GPU Programming)
- Thread blocks, warps, shared memory
- Memory coalescing, kernel optimization
- Bash/Shell (DevOps, automation)
- SQL (Data management)
- Rust (Optional β emerging for inference engines)
1.2 Computer Science Fundamentals
- Data Structures: Arrays, Trees, Graphs, Hash Tables, Heaps
- Algorithms: Sorting, Searching, Dynamic Programming, Graph algorithms
- Complexity Analysis: Big-O notation, space/time tradeoffs
- Distributed Systems: CAP theorem, consensus algorithms, sharding
- Operating Systems: Process management, memory paging, I/O
- Computer Networks: TCP/IP, HTTP/2, gRPC, WebSockets
- Databases: Relational (PostgreSQL), NoSQL (MongoDB, Redis), Vector DBs
1.3 Software Engineering Practices
- Version Control: Git, GitHub, branching strategies
- Testing: Unit, integration, regression, load testing
- CI/CD: GitHub Actions, Jenkins, Docker, Kubernetes
- Design Patterns: Factory, Observer, Strategy, Pipeline
- API Design: REST, GraphQL, gRPC
- Containerization: Docker Compose, Kubernetes orchestration
2. Mathematics & Statistics Deep Dive
2.1 Linear Algebra (Most Critical)
- Vectors & Spaces
- Vector operations, dot products, cross products
- Vector spaces, basis, span, linear independence
- Subspaces, null space, column space
- Matrices
- Matrix multiplication, transpose, inverse
- Rank, determinant, trace
- Special matrices: diagonal, orthogonal, symmetric, positive definite
- Eigendecomposition
- Eigenvalues, eigenvectors, characteristic polynomial
- Diagonalization, spectral theorem
- Power iteration, QR algorithm
- Singular Value Decomposition (SVD)
- Full vs. truncated SVD
- Applications in dimensionality reduction, LoRA
- Relationship to PCA
- Tensor Operations
- Higher-order tensors, tensor contractions
- Einstein summation notation (einsum)
- Tensor decomposition (Tucker, CP)
- Norms & Distances
- L1, L2, Frobenius, nuclear norms
- Cosine similarity, KL divergence as distance
2.2 Calculus & Optimization
- Differential Calculus
- Derivatives, partial derivatives, directional derivatives
- Chain rule, product rule, quotient rule
- Jacobian matrix, Hessian matrix
- Taylor series expansion
- Integral Calculus
- Definite/indefinite integrals
- Fundamental theorem of calculus
- Numerical integration (quadrature)
- Multivariable Calculus
- Gradient, divergence, curl
- Lagrange multipliers, constrained optimization
- Vector fields and flow
- Optimization Theory
- Convex vs. non-convex optimization
- First and second-order optimality conditions
- Saddle points, local vs. global minima
- Lagrangian relaxation, KKT conditions
2.3 Probability & Statistics
- Probability Theory
- Probability spaces, sample spaces, events
- Conditional probability, Bayes' theorem
- Law of large numbers, central limit theorem
- Moment generating functions
- Probability Distributions
- Discrete: Bernoulli, Binomial, Poisson, Categorical
- Continuous: Gaussian, Uniform, Beta, Dirichlet, Laplace
- Multivariate distributions, covariance matrices
- Information Theory
- Entropy, cross-entropy, joint entropy
- Kullback-Leibler (KL) divergence
- Mutual information, Jensen-Shannon divergence
- Minimum description length
- Statistical Estimation
- Maximum likelihood estimation (MLE)
- Maximum a posteriori (MAP)
- Bayesian inference, prior/posterior
- Expectation-Maximization (EM) algorithm
- Sampling Methods
- Monte Carlo sampling
- Markov Chain Monte Carlo (MCMC)
- Importance sampling
- Temperature sampling, top-k, top-p (nucleus sampling)
2.4 Numerical Methods
- Floating point arithmetic, precision issues (fp16, bf16, fp32)
- Numerical stability, gradient clipping
- Fast Fourier Transform (FFT)
- Sparse matrix operations
- Iterative solvers (conjugate gradient)
3. Machine Learning Fundamentals
3.1 Core Concepts
- Supervised, Unsupervised, Semi-supervised, Self-supervised learning
- Bias-variance tradeoff, overfitting, underfitting
- Regularization: L1/L2, dropout, weight decay, early stopping
- Cross-validation, hyperparameter tuning
- Feature engineering, normalization, standardization
3.2 Classical Algorithms
- Linear Regression, Logistic Regression
- Decision Trees, Random Forests, Gradient Boosting (XGBoost, LightGBM)
- Support Vector Machines (SVM), kernel trick
- K-Nearest Neighbors (KNN)
- Naive Bayes, Gaussian Mixture Models
- PCA, t-SNE, UMAP (dimensionality reduction)
- K-Means, DBSCAN, Hierarchical clustering
3.3 Gradient Descent & Optimizers
- Vanilla Gradient Descent β full-batch, slow but stable
- Stochastic Gradient Descent (SGD) β noisy but generalizes
- Mini-batch SGD β industry standard balance
- Momentum β exponential moving average of gradients
- Nesterov Momentum β look-ahead momentum update
- AdaGrad β per-parameter adaptive learning rate
- RMSProp β decaying average of squared gradients
- Adam β combines momentum + RMSProp
- m_t = Ξ²1 * m_{t-1} + (1 - Ξ²1) * g_t
- v_t = Ξ²2 * v_{t-1} + (1 - Ξ²2) * g_tΒ²
- ΞΈ = ΞΈ - Ξ± * mΜ_t / (βvΜ_t + Ξ΅)
- AdamW β Adam with decoupled weight decay (preferred for LLMs)
- Lion β EvoLved Sign Momentum (Google, 2023)
- Sophia β Second-order optimizer for LLMs
- LAMB/LARS β Large-batch distributed training optimizers
3.4 Loss Functions
- Mean Squared Error (MSE), Mean Absolute Error (MAE)
- Cross-Entropy Loss (language modeling: next-token prediction)
- Binary Cross-Entropy, Categorical Cross-Entropy
- Contrastive Loss, Triplet Loss (for embeddings)
- REINFORCE / Policy Gradient Loss (for RLHF)
4. Deep Learning Core
4.1 Neural Network Basics
- Perceptron, Multi-layer Perceptron (MLP)
- Activation functions:
- ReLU: max(0, x) β dead neuron problem
- GeLU: x * Ξ¦(x) β smooth, used in GPT/BERT
- SiLU/Swish: x * sigmoid(x) β Llama uses this
- Mish, ELU, Leaky ReLU
- Softmax: e^xi / Ξ£e^xj β for probability distributions
- Backpropagation algorithm, automatic differentiation
- Weight initialization: Xavier/Glorot, He initialization, normal/uniform
4.2 Normalization Techniques
- Batch Normalization β normalizes across batch dimension
- ΞΌ_B = (1/m) Ξ£x_i; ΟΒ²_B = (1/m) Ξ£(x_i - ΞΌ_B)Β²
- Problems with small batches, sequential data
- Layer Normalization β normalizes across feature dimension
- Used in all modern Transformers
- LN(x) = (x - ΞΌ) / Ο * Ξ³ + Ξ²
- RMS Normalization (RMSNorm) β simplified LayerNorm
- RMSNorm(x) = x / RMS(x) * Ξ³; no mean subtraction
- Used in Llama, Mistral β more efficient
- Group Normalization β between BatchNorm and LayerNorm
- Pre-Norm vs. Post-Norm β Pre-Norm (before attention) is more stable for deep networks
4.3 Regularization Deep Dive
- Dropout β randomly zero out neurons during training
- DropPath/Stochastic Depth β drop entire residual paths
- Label Smoothing β soften hard labels to prevent overconfidence
- Weight Decay (L2) β penalize large weights
- Gradient Clipping β cap gradient norm to prevent explosion
- Mixup / CutMix β data augmentation regularizers
4.4 CNN, RNN, LSTM (Pre-Transformer Context)
- Convolutional Neural Networks β local feature extraction
- Recurrent Neural Networks β sequential dependencies
- Vanishing/exploding gradient problem
- Long Short-Term Memory (LSTM) β gating mechanisms
- Gated Recurrent Unit (GRU) β simplified LSTM
- Seq2Seq with attention β foundation of Transformers
- Encoder-Decoder architecture β original MT framework
5. Natural Language Processing (NLP)
5.1 Text Preprocessing
- Tokenization strategies:
- Word-level: simple but large vocabulary
- Character-level: small vocab but long sequences
- Subword: best of both worlds
- Byte-Pair Encoding (BPE) β GPT-2, GPT-4 tokenizer
- Start with characters, merge most frequent pairs iteratively
- Creates vocabulary of ~32k-100k tokens
- WordPiece β BERT tokenizer, similar to BPE
- SentencePiece β language-agnostic, Llama, T5
- Unigram language model variant
- Tiktoken β OpenAI's fast tokenizer library
- Stop words, stemming, lemmatization (less used with LLMs)
- Text normalization: lowercasing, Unicode handling
5.2 Word Embeddings (Pre-Transformer)
- Word2Vec (2013, Google)
- Skip-gram: predict context from center word
- CBOW: predict center from context
- Negative sampling optimization
- GloVe β Global Vectors, co-occurrence matrix factorization
- FastText β subword embeddings, handles OOV
- ELMo β contextual embeddings from bidirectional LSTM
- Semantic similarity, analogy tasks (king - man + woman = queen)
5.3 Classic NLP Tasks (Now handled end-to-end by LLMs)
- Named Entity Recognition (NER)
- Part-of-Speech (POS) tagging
- Sentiment Analysis
- Machine Translation
- Summarization
- Question Answering
- Text Classification
6. Transformer Architecture β The Heart of LLMs
6.1 Original Transformer ("Attention Is All You Need" β 2017)
- Input Embedding β token IDs β dense vectors (dim d_model)
- Positional Encoding β add position info since no recurrence
- Sinusoidal: PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
- Learnable positional embeddings (BERT, GPT)
- Encoder Stack β bidirectional, used for understanding
- Decoder Stack β autoregressive, used for generation
- Cross-Attention β decoder attends to encoder outputs
6.2 Attention Mechanism β Complete Breakdown
Attention(Q, K, V) = softmax(QK^T / βd_k) * V
Where:
- Q = Query matrix (what we're looking for)
- K = Key matrix (what's available to match)
- V = Value matrix (what we actually retrieve)
- d_k = dimension of keys (scaling factor)
- Self-Attention β Q, K, V all come from same input
- Cross-Attention β Q from decoder, K/V from encoder
- Causal/Masked Attention β mask future tokens (GPT-style)
- Lower-triangular mask: M_ij = 0 if j > i, else -β
6.3 Multi-Head Attention (MHA)
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) * W_O
head_i = Attention(Q*W_Q_i, K*W_K_i, V*W_V_i)
- Each head learns different aspects of relationships
- Typical: 8, 12, 16, 32, 64 heads
- Parallel computation, concatenate then project
6.4 Feed-Forward Network (FFN)
FFN(x) = max(0, xW_1 + b_1)W_2 + b_2
Or with GeLU:
FFN(x) = GeLU(xW_1 + b_1) * W_2 + b_2
SwiGLU variant (Llama):
FFN(x) = (SiLU(xW_1) * xW_3) * W_2
- d_ff β 4 * d_model (hidden expansion)
- Two-thirds are learnable; one-third is activation
6.5 Residual Connections & Layer Norm
- Residual Connection: x = x + Sublayer(LN(x))
- Pre-norm (before sublayer) β better gradient flow
- Post-norm (after sublayer) β original paper style
- Why residuals: prevent vanishing gradients in deep networks
6.6 Positional Encoding Evolution
- Absolute Positional Encoding β fixed sinusoidal (original)
- Learnable Absolute PE β BERT, GPT-2
- Relative Positional Encoding β Transformer-XL
- Encode distance between tokens, not absolute positions
- ALiBi (Attention with Linear Biases) β linear penalty
- Add bias -|i-j| * m to attention scores
- Better length generalization than sinusoidal
- RoPE (Rotary Position Embedding) β GPT-NeoX, Llama
- Rotate Q and K vectors by angle proportional to position
- RoPE(x, pos) = x * e^(i * ΞΈ * pos)
- Excellent length generalization, used in most SOTA models
- YaRN/LongRoPE β extend context of RoPE models
- NoPE β no positional encoding, rely on attention patterns
6.7 Attention Variants & Optimizations
- Multi-Query Attention (MQA) β single K, V shared across heads
- Reduces KV cache size by num_heads factor (PaLM, Falcon)
- Grouped Query Attention (GQA) β groups of heads share K, V
- Balance between MHA quality and MQA efficiency (Llama 2/3)
- Sliding Window Attention β each token attends to local window
- Mistral uses 4096-token sliding window
- Flash Attention β IO-aware exact attention algorithm
- Tiles Q, K, V to fit in GPU SRAM
- No materializing full NΓN attention matrix
- 2-4x speedup, O(N) memory instead of O(NΒ²)
- Flash Attention 2 & 3 β further optimizations for H100
- PagedAttention β vLLM's memory-efficient KV cache paging
- Ring Attention β distributes attention across devices for ultra-long sequences
- Sparse Attention β attend to subset of tokens (Longformer, BigBird)
- Linear Attention β approximate attention in O(N) time
7. Large Language Model Internals
7.1 Model Families & Architectures
- Encoder-Only (BERT family)
- Bidirectional context, MLM pre-training
- Best for: classification, NER, embeddings
- Examples: BERT, RoBERTa, DeBERTa, ELECTRA
- Decoder-Only (GPT family) β Most modern LLMs
- Causal/autoregressive, CLM pre-training
- Best for: generation, chat, reasoning
- Examples: GPT-4, Claude, Llama, Mistral, Gemini
- Encoder-Decoder (T5/Seq2Seq family)
- Encoder reads input, decoder generates output
- Best for: translation, summarization with source
- Examples: T5, FLAN-T5, BART, mBART
7.2 Scaling Laws
- Chinchilla Scaling Laws (Hoffmann et al., 2022)
- Optimal: train tokens β 20Γ number of parameters
- N_optimal β 1.69 Γ 10^9 Γ C^0.49 (compute C in FLOPs)
- C_optimal β 6ND (N params, D tokens)
- GPT-3: undertrained; Chinchilla: same compute, more tokens
- OpenAI Scaling Laws (Kaplan et al., 2020)
- Loss scales as power law with compute, data, parameters
- L(N) ~ N^{-0.076}; L(D) ~ D^{-0.095}
- Emergent Abilities β appear suddenly at certain scales
- In-context learning (~1B+), chain-of-thought (~100B+)
- Neural Scaling Laws for LLM Inference
- Larger model + fewer inference steps > smaller model + more steps
7.3 Context Window & Memory
- Context window β maximum tokens model can process
- KV Cache β cache Key and Value tensors during generation
- Memory: 2 Γ num_layers Γ num_heads Γ head_dim Γ seq_len Γ bytes_per_element
- For Llama-3-70B, 100K context: ~30GB just for KV cache
- KV Cache Compression
- StreamingLLM β sink tokens + recent window
- SnapKV β select important KV pairs
- MLA (Multi-head Latent Attention) β DeepSeek's innovation
- Positional interpolation β extend context beyond training length
- Infini-attention β compressive memory for infinite context
- Mamba/SSM β linear recurrence, O(1) memory per step
7.4 Tokenizer Design Details
- Vocabulary size tradeoffs
- Larger vocab: shorter sequences, faster, but larger embedding table
- Smaller vocab: longer sequences, slower, smaller model
- Typical: 32K (Llama 2), 128K (Llama 3, GPT-4), 256K (Gemini)
- Special tokens: [BOS], [EOS], [PAD], [UNK], [MASK]
- Chat templates: system/user/assistant turn formatting
- Llama: <|begin_of_text|><|start_header_id|>system<|end_header_id|>...
- ChatML: <|im_start|>system\n...<|im_end|>
7.5 Mixture of Experts (MoE)
- Replace dense FFN with N expert FFNs + router
- Top-K Routing: only K experts activated per token (K=1 or 2)
- Load Balancing Loss: encourage equal use of all experts
- Sparse MoE: Mixtral 8x7B, 8x22B β 8 experts, 2 active
- Fine-grained MoE: DeepSeek-V3 β 256 experts, 8 active
- Expert Choice routing β experts choose tokens (better balance)
- Advantages: massive parameter count, same compute cost
8. Training Pipeline β From Scratch to Advance
8.1 Data Collection & Curation
Sources
- Common Crawl β petabyte-scale web crawl (CC-Main, CC-News)
- The Pile β EleutherAI's 825GB diverse dataset
- RedPajama β open reproduction of LLaMA training data
- ROOTS β multilingual BLOOM training data
- Books: Project Gutenberg, Books3, BookCorpus
- Code: GitHub (The Stack, StarCoder data), code contests
- Scientific Papers: arXiv, PubMed, S2ORC
- Wikipedia/Wikidata β high-quality factual text
- StackExchange, Reddit β Q&A, discussion
- Multilingual: CC-100, mC4, CulturaX
Data Processing Pipeline
Raw HTML/Text
β
URL/Domain Filtering (dedup, quality domains)
β
Language Identification (fastText, langdetect)
β
Quality Filtering:
- Perplexity filter (KenLM)
- Heuristics (short docs, repetition ratio, symbol ratio)
- ML classifiers (CCNet, Gopher quality filters)
β
Deduplication:
- Exact: MD5/SHA256 hashing
- Near-duplicate: MinHash LSH (SimHash)
- Semantic: embedding-based dedup
β
PII Removal (emails, phone numbers, SSNs)
β
Tokenization & Packing
β
Binary format (numpy memmap, HDF5, WebDataset)
Data Mixing & Weighting
- Domain weighting: upweight high-quality sources
- Data mixing ratios (e.g., 80% web, 10% code, 5% books, 5% science)
- Data flywheels: use trained model to filter better data
- DSIR (Data Selection via Importance Resampling) β target-aware
- DoReMi β automatic domain weight optimization
8.2 Model Architecture Configuration
Hyperparameter Selection Table
Model Size | d_model | n_layers | n_heads | d_ff | Params
-----------|---------|----------|---------|---------|--------
125M | 768 | 12 | 12 | 3072 | ~125M
1.3B | 2048 | 24 | 16 | 8192 | ~1.3B
7B | 4096 | 32 | 32 | 11008 | ~7B
13B | 5120 | 40 | 40 | 13824 | ~13B
30B | 6656 | 60 | 52 | 17920 | ~30B
70B | 8192 | 80 | 64 | 28672 | ~70B
175B(GPT3) | 12288 | 96 | 96 | 49152 | ~175B
8.3 Pre-Training
Causal Language Modeling Objective
L_CLM = -Ξ£ log P(x_t | x_1, ..., x_{t-1})
For each sequence: predict next token given all previous tokens
Cross-entropy loss averaged over all positions
Training Configuration
- Batch size: typically 256β4096 sequences
- Sequence length: 2048β8192 tokens per sequence
- Global batch size: micro_batch Γ grad_accum Γ world_size
- Learning rate schedule:
- Linear warmup (1000β2000 steps)
- Cosine decay to lr_min = 0.1 Γ lr_max
- lr_max typically 1e-4 to 3e-4
- Weight decay: 0.1 (AdamW standard)
- Gradient clipping: clip at 1.0 norm
- Ξ²1=0.9, Ξ²2=0.95, Ξ΅=1e-8 (Adam hyperparameters for LLMs)
Training Stability Techniques
- Loss spikes: reduce lr, check data quality at spike step
- Gradient norm monitoring: track throughout training
- Loss divergence recovery: reload checkpoint, skip data batch
- Z-loss regularization: penalize large logit magnitudes
- QK Norm: normalize Q and K before attention score computation
- Checkpoint averaging: average last N checkpoints for stability
8.4 Distributed Training
Data Parallelism (DP)
- Replicate model on each GPU
- Each GPU processes different batch
- Synchronize gradients via AllReduce after backward
- DDP (PyTorch), Horovod
- FSDP (Fully Sharded Data Parallel) β ZeRO Stage 3
ZeRO (Zero Redundancy Optimizer) β DeepSpeed
Stage 0: Baseline DDP (model replicated)
Stage 1: Shard optimizer states across GPUs
Stage 2: Shard optimizer states + gradients
Stage 3: Shard optimizer states + gradients + parameters
(full model sharding β needed for 70B+ on 8 GPUs)
ZeRO-Infinity: offload to CPU/NVMe for extreme scale
Tensor Parallelism (TP) β Megatron-LM
- Split individual weight matrices across GPUs
- Column parallel: split W along output dimension
- Row parallel: split W along input dimension
- Requires AllReduce at each forward/backward
- Best for very large layers (d_model = 8192+)
Pipeline Parallelism (PP)
- Assign layers to different GPUs/nodes
- GPipe: micro-batches flow through pipeline
- PipeDream: 1F1B (one forward, one backward) schedule
- Bubble overhead: (p-1)/(m+p-1) for p stages, m micro-batches
- Interleaved pipeline: reduces bubble, increases memory
Sequence Parallelism (SP)
- Distribute long sequence across devices
- Each device handles chunk of sequence length
- Ring Attention: pass KV around ring of devices
- Useful for 100K+ context training
3D Parallelism (Megatron-DeepSpeed)
- Combine DP + TP + PP for training 100B+ models
- Example: 175B on 1024 GPUs: DP=8, TP=8, PP=16
8.5 Mixed Precision Training
- FP32 β full precision, safe but 2Γ memory vs FP16
- FP16 β 5-bit exponent, 10-bit mantissa, can overflow
- BF16 β 8-bit exponent, 7-bit mantissa, same range as FP32
- Preferred for LLM training (no loss scaling needed)
- AMP (Automatic Mixed Precision):
- Keep master weights in FP32
- Forward/backward in FP16/BF16
- Update master FP32 weights
- FP8 Training β H100 native, needs careful scaling
- Transformer Engine (NVIDIA) handles FP8 automatically
8.6 Checkpointing & Recovery
- Save every N steps (N = 500β2000 typically)
- Checkpoint includes: model weights, optimizer states, scheduler state, RNG state
- Activation Checkpointing β recompute activations during backward to save memory
- Trade 33% extra compute for ~10Γ memory savings
- Selective Activation Checkpointing β checkpoint only expensive ops
- Distributed checkpoint sharding (each rank saves its own shard)
9. RLHF, Alignment & Fine-Tuning
9.1 Supervised Fine-Tuning (SFT)
- Collect instruction-response pairs (hundreds of thousands)
- Data formats:
- Alpaca format: instruction/input/output
- ShareGPT format: multi-turn conversations
- FLAN/T0: task-specific instruction templates
- Fine-tune with teacher forcing on completions only
- Mask loss on prompt tokens, compute only on response
- Key datasets: OpenAssistant, Dolly, FLAN, WizardLM, UltraChat
9.2 RLHF Pipeline (Reinforcement Learning from Human Feedback)
Step 1: SFT Model (instruction-following base)
β
Step 2: Preference Data Collection
Human annotators compare 2+ model outputs
Rank: A > B or A = B
Collect ~50Kβ1M comparisons
β
Step 3: Reward Model Training
Bradley-Terry model:
L_RM = -E[log Ο(r(x, y_w) - r(x, y_l))]
where y_w = preferred response, y_l = rejected
β
Step 4: PPO Training
Maximize: E[r(x, y)] - Ξ² * KL(Ο_ΞΈ || Ο_ref)
KL penalty prevents model from deviating too far from SFT
PPO (Proximal Policy Optimization) Details
L_CLIP = E[min(r_t(ΞΈ) * A_t, clip(r_t(ΞΈ), 1-Ξ΅, 1+Ξ΅) * A_t)]
r_t(ΞΈ) = Ο_ΞΈ(a_t|s_t) / Ο_ΞΈ_old(a_t|s_t) (probability ratio)
A_t = advantage estimate (GAE: Generalized Advantage Estimation)
Ξ΅ = 0.2 (clipping parameter)
Value function:
L_VF = E[(V_ΞΈ(s_t) - V_target)Β²]
Total loss:
L = L_CLIP - c1 * L_VF + c2 * S[Ο_ΞΈ](s_t) (entropy bonus)
9.3 DPO (Direct Preference Optimization) β Simpler RLHF
L_DPO = -E[log Ο(Ξ² * (log Ο_ΞΈ(y_w|x)/Ο_ref(y_w|x) - log Ο_ΞΈ(y_l|x)/Ο_ref(y_l|x)))]
Advantages over RLHF-PPO:
- No separate reward model needed
- More stable training
- Simpler implementation
- Comparable or better results
- Variants: IPO, KTO, ORPO, SimPO, CPO
9.4 Parameter-Efficient Fine-Tuning (PEFT)
LoRA (Low-Rank Adaptation)
W = W_0 + ΞW = W_0 + BA
Where B β R^(dΓr), A β R^(rΓk), r << min(d,k)
Typically r = 4, 8, 16, 32, 64
Number of trainable params: r*(d+k) vs d*k
Reduction: (r*(d+k)) / (d*k) β r/min(d,k)
Example: 4096Γ4096 layer, r=16: 99.8% param reduction
- QLoRA β quantize base model to 4-bit, train LoRA adapters
- NF4 quantization (Normal Float 4-bit)
- Double quantization: quantize quantization constants too
- Paged Optimizers: offload optimizer states to CPU RAM
- LoRA+ β different learning rates for A and B matrices
- DoRA β decompose into magnitude + direction components
- LoRA-FA β frozen A matrix, only train B
- AdaLoRA β adaptive rank allocation per layer
- PiSSA β principal singular values and singular vectors
- Prefix Tuning β trainable prefix tokens prepended to each layer
- P-Tuning v2 β deep prompt tuning
- IAΒ³ β rescale activations with learned vectors (< 0.1% params)
9.5 Constitutional AI (Anthropic's Claude Approach)
- Define principles/constitution for model behavior
- CAI-SL: fine-tune on critiques and revisions following constitution
- CAI-RL: use AI feedback instead of human (RLAIF)
- Generate responses β evaluate with constitution β rank β RL
- Red-teaming: adversarial probing for harmful outputs
- Harmlessness + Helpfulness + Honesty triad
9.6 Model Merging
- SLERP β spherical linear interpolation of model weights
- TIES-Merging β trim + elect sign + disjoint merge
- DARE β random pruning before merging (sparse delta weights)
- Model Soup β average fine-tuned models (Wortsman et al.)
- Mergekit library for practical model merging
10. Major Algorithms & Techniques Reference
10.1 Generation Algorithms
- Greedy Decoding β always pick highest probability token
- Beam Search β maintain top-B hypotheses at each step
- Sampling β sample from probability distribution
- Temperature Scaling β T < 1 = sharper; T > 1 = softer
- P'(x) = softmax(logits / T)
- Top-K Sampling β sample from top K tokens only
- Top-P (Nucleus) Sampling β sample from tokens summing to probability P
- Min-P Sampling β minimum probability threshold relative to top token
- Typical Sampling β sample tokens of typical information content
- Contrastive Search β maximize: (1-Ξ±)*p(x) - Ξ±*max_j cos_sim(h_x, h_xj)
- Speculative Decoding β small draft model proposes, large model verifies
- ~2-4Γ speedup with no quality loss
- Medusa β parallel draft heads on single model
10.2 Inference Optimization
- KV Cache β store past key-value pairs, O(1) per new token
- Continuous Batching β dynamic batching, no waiting for sequences to finish
- PagedAttention β virtual memory for KV cache
- Quantization:
- PTQ (Post-Training Quantization): GPTQ, AWQ, SmoothQuant
- QAT (Quantization-Aware Training): fp8, int8 training
- W4A16: 4-bit weights, 16-bit activations (most common)
- GGUF format: llama.cpp quantization (Q4_K_M, Q5_K_S, etc.)
- Weight Sharing/Tying β tie embedding and output projection
- Knowledge Distillation β small student learns from large teacher
- Response distillation: match output distributions
- Feature distillation: match intermediate representations
- Pruning:
- Magnitude pruning: remove small-weight connections
- Structured pruning: remove entire heads/layers
- SparseGPT: one-shot unstructured pruning for GPT models
10.3 Long Context Techniques
- Sliding Window Attention β Mistral's local attention
- LongFormer β local + global attention tokens
- BigBird β random + local + global attention
- ALiBi β linear bias enables zero-shot length generalization
- RoPE scaling variants:
- Linear interpolation (position_id / scale_factor)
- NTK-aware interpolation
- YaRN (Yet Another RoPE Extension)
- LongRoPE (progressive rescaling)
- Retrieval Augmented Generation (RAG):
- Dense retrieval (DPR, E5, BGE embeddings)
- Sparse retrieval (BM25)
- Hybrid retrieval
- Reranking (ColBERT, cross-encoder)
10.4 Reasoning & Chain of Thought
- Chain-of-Thought (CoT) prompting β "Let's think step by step"
- Self-Consistency β sample multiple CoT paths, majority vote
- Tree of Thought (ToT) β tree search over reasoning steps
- Graph of Thought β arbitrary DAG of reasoning
- Program-Aided Language Models (PAL) β generate executable code
- ReAct β interleaved reasoning and action (tool use)
- Process Reward Models (PRM) β reward each reasoning step
- Outcome Reward Models (ORM) β reward final answer only
- MCTS for LLM reasoning β Monte Carlo Tree Search guided by PRM
- o1/o3-style reasoning β long chain-of-thought with test-time compute scaling
11. Tools, Frameworks & Libraries
11.1 Deep Learning Frameworks
| Framework | Use Case | Key Feature |
|---|---|---|
| PyTorch | Research & production | Dynamic graphs, pythonic |
| JAX | Google TPU training | XLA compilation, functional |
| TensorFlow | Production deployment | TF Serving, TFLite |
| MXNet | AWS ecosystem | Gluon API |
| PaddlePaddle | Baidu ecosystem | Chinese NLP focus |
11.2 LLM Training Frameworks
| Framework | Organization | Best For |
|---|---|---|
| Megatron-LM | NVIDIA | Large-scale 3D parallel training |
| DeepSpeed | Microsoft | ZeRO optimization, ZeRO-Infinity |
| FSDP | Meta/PyTorch | Simpler full sharding |
| Colossal-AI | HPC-AI Tech | Heterogeneous training |
| Alpa | UCB/Google | Auto-parallelism |
| LLaMA-Factory | Community | Fine-tuning factory |
| Axolotl | OpenAccess | YAML-configured fine-tuning |
| TRL | HuggingFace | RLHF/DPO training |
| OpenRLHF | OpenLLMAI | Scalable RLHF |
| Nanotron | HuggingFace | Lightweight pre-training |
11.3 Inference Frameworks
| Framework | Focus | Key Feature |
|---|---|---|
| vLLM | High throughput | PagedAttention, continuous batching |
| TGI (Text Generation Inference) | HuggingFace | Production API |
| llama.cpp | Local/edge | CPU inference, GGUF |
| Ollama | Local deployment | Easy model management |
| TensorRT-LLM | NVIDIA GPU | TensorRT optimized kernels |
| MLC-LLM | Multi-platform | Web, mobile, server |
| ExLlamaV2 | Consumer GPU | GPTQ inference |
| CTransformers | Python bindings | llama.cpp Python |
| LightLLM | Triton kernels | FlashAttention2 |
| SGLang | Structured generation | RadixAttention |
11.4 Model Hub & Ecosystem
- HuggingFace Hub β 500K+ models, datasets, spaces
- Transformers library: universal model API
- Datasets library: 50K+ datasets
- PEFT library: LoRA, prefix tuning, etc.
- Accelerate: multi-GPU/TPU training
- Tokenizers: fast Rust tokenizers
- PyTorch Hub β model repository
- Weights & Biases (WandB) β experiment tracking
- MLflow β experiment tracking + model registry
- DVC β data version control
- LangChain β LLM application framework
- LlamaIndex β RAG and data indexing
- Haystack β NLP pipeline framework
11.5 Data Processing
- Apache Spark β distributed data processing
- Ray β distributed Python, Ray Data
- DataTrove β HuggingFace data processing pipeline
- The Stack deduplication β MinHash LSH at scale
- SentenceTransformers β embedding models
- FAISS β fast ANN search for vectors
- Elasticsearch β BM25 + vector search
11.6 Evaluation Frameworks
| Benchmark | Tests | Size |
|---|---|---|
| MMLU | World knowledge, 57 subjects | 14K questions |
| HellaSwag | Commonsense reasoning | 70K examples |
| HumanEval | Code generation | 164 problems |
| MBPP | Python programming | 500 problems |
| GSM8K | Grade school math | 8.5K problems |
| MATH | Competition math | 12.5K problems |
| ARC-Challenge | Science QA | 1.2K questions |
| TruthfulQA | Factual accuracy | 817 questions |
| BIG-Bench | Diverse reasoning | 204 tasks |
| MT-Bench | Chat multi-turn | 80 questions |
| Chatbot Arena | Human preference | 1M+ votes |
| lm-evaluation-harness | EleutherAI eval framework | All benchmarks |
12. Hardware Requirements by Model Type
12.1 GPU Reference Table
Consumer GPUs
| GPU | VRAM | Memory BW | TFLOPs (BF16) | Best Use |
|---|---|---|---|---|
| RTX 3080 | 10GB | 760 GB/s | 29.8 | Inference β€7B |
| RTX 3090 | 24GB | 936 GB/s | 35.6 | Inference β€13B |
| RTX 4080 | 16GB | 736 GB/s | 48.7 | Inference β€13B |
| RTX 4090 | 24GB | 1008 GB/s | 82.6 | Inference/fine-tune β€33B |
| RTX 5090 | 32GB | 1792 GB/s | 209 | Inference/fine-tune β€70B |
Data Center GPUs
| GPU | VRAM | Memory BW | TFLOPs (BF16) | NVLink | Best Use |
|---|---|---|---|---|---|
| A100 40GB | 40GB | 1.6 TB/s | 312 | Yes | Training β€13B |
| A100 80GB | 80GB | 2.0 TB/s | 312 | Yes | Training β€30B |
| H100 SXM | 80GB | 3.35 TB/s | 989 | Yes | Training β€70B |
| H100 NVL | 94GB | 3.9 TB/s | 1979 | Yes | Large models |
| H200 | 141GB | 4.8 TB/s | 1979 | Yes | 70B+ training |
| B200 | 192GB | 8 TB/s | ~4500 | Yes | Frontier models |
Multi-GPU Requirements for Training
Model Size | FP16/BF16 Memory | 8ΓA100 40GB | 8ΓA100 80GB | 8ΓH100
-----------|------------------|-------------|-------------|--------
7B params | ~14GB (weights) | Yes | Yes | Yes
13B | ~26GB | With ZeRO3 | Yes | Yes
30B | ~60GB | With ZeRO3 | With ZeRO3 | Yes
70B | ~140GB | No | With ZeRO3 | Yes
175B | ~350GB | No | 4 nodes | 2 nodes
405B | ~810GB | No | No | 4+ nodes
12.2 Memory Math
Model Parameters Memory (bytes):
- FP32: 4 bytes/param
- BF16/FP16: 2 bytes/param
- INT8: 1 byte/param
- INT4/NF4: 0.5 bytes/param
Training Memory = model + gradients + optimizer states
- SGD: model + gradients = 2Γmodel
- Adam/AdamW: model + gradients + 2Γoptimizer = 4Γmodel
- AMP: model(fp16) + model(fp32 master) + gradients + optimizer
= 2 + 4 + 2 + 8 = 16 bytes/param
Inference Memory = model + KV_cache + activations
KV cache = 2 Γ n_layers Γ n_kv_heads Γ head_dim Γ seq_len Γ bytes
Example: 7B model training with AdamW in BF16:
7B Γ 16 bytes = 112GB β needs 2Γ A100 80GB with ZeRO
12.3 TPU (Google)
- TPU v4: 275 TFLOPS BF16, 32GB HBM, 600 GB/s
- TPU v5e: purpose-built inference, 4Γ efficiency vs v4
- TPU v5p: training powerhouse, 459 TFLOPS
- Cloud TPU Pods: 4096 chips interconnected (exaFLOP scale)
- Native JAX/XLA support; PyTorch via torch_xla
12.4 Infrastructure
- Networking: InfiniBand (400 Gb/s) between nodes, NVLink within node
- Storage: Lustre parallel filesystem, AWS FSx, GCS
- Read bandwidth: 1-100 GB/s for efficient data loading
- CPU: AMD EPYC/Intel Xeon for data preprocessing
- RAM: 512GBβ2TB per node for large batch, ZeRO-Infinity offload
- NVMe: 30+ TB fast local storage for checkpoint/cache
- Power: H100 SXM: 700W; full node (8Γ H100): ~10kW
13. Architecture Designs β Working Principles
13.1 Complete Transformer Forward Pass (Decoder-Only)
INPUT: Token sequence [t_1, t_2, ..., t_n]
STEP 1: Embedding
x = Embedding(tokens) + PositionalEncoding(positions)
x β R^(n Γ d_model)
STEP 2: For each of L transformer layers:
a) Layer Norm (Pre-Norm)
x_norm = LN(x) or RMSNorm(x)
b) Causal Self-Attention
Q = x_norm @ W_Q K = x_norm @ W_K V = x_norm @ W_V
Apply RoPE to Q and K:
Q, K = apply_rotary_embedding(Q, K, positions)
Split into h attention heads
For each head i:
A_i = softmax((Q_i @ K_i^T) / βd_k + causal_mask) @ V_i
Concatenate: A = [A_1; A_2; ...; A_h]
Output: Attn_out = A @ W_O
Residual: x = x + Attn_out
c) Layer Norm (Pre-Norm again)
x_norm2 = LN(x) or RMSNorm(x)
d) Feed-Forward Network (SwiGLU)
gate = SiLU(x_norm2 @ W_gate)
up = x_norm2 @ W_up
FFN = (gate * up) @ W_down
Residual: x = x + FFN
STEP 3: Final Layer Norm
x = RMSNorm(x)
STEP 4: Language Model Head (Linear projection + Softmax)
logits = x @ W_lm_head (or use tied embedding weights)
probs = softmax(logits / temperature)
STEP 5: Sample/select next token
next_token = sample(probs) or argmax(probs)
13.2 GPT Architecture (Decoder-Only)
- Unidirectional attention (causal mask)
- Predict next token: P(x_t | x_1...x_{t-1})
- GPT-1: 12 layers, 768 d_model, 117M params
- GPT-2: 48 layers, 1600 d_model, 1.5B params
- GPT-3: 96 layers, 12288 d_model, 175B params
- GPT-4: MoE, ~8Γ220B, estimated ~1.8T params (unconfirmed)
- Training: Causal LM, web text
- No official architecture paper for GPT-4
13.3 Llama Architecture Details
- Llama 1/2: RoPE, RMSNorm, SwiGLU FFN, GQA (Llama 2)
- Llama 3: 128K vocab, GQA, 128K context
- Architecture difference from GPT:
- No biases in linear layers
- RMSNorm instead of LayerNorm (no mean subtraction)
- RoPE instead of absolute PE
- SwiGLU with 3 matrices (up, gate, down) instead of 2
- GQA: fewer KV heads than query heads
13.4 BERT Architecture (Encoder-Only)
- Bidirectional attention (all tokens attend to all)
- Pre-training objectives:
- MLM: mask 15% tokens, predict them
- NSP: predict if sentence B follows sentence A
- Fine-tune on downstream tasks with task head
- CLS token embedding β classification
13.5 T5 Architecture (Encoder-Decoder)
- Encoder: bidirectional full attention
- Decoder: causal self-attention + cross-attention to encoder
- All tasks as text-to-text: "Translate: ..." β "..."
- Relative positional biases instead of absolute PE
13.6 Mamba / SSM Architecture (Alternative to Transformer)
State Space Model Core:
h'(t) = A * h(t) + B * x(t)
y(t) = C * h(t) + D * x(t)
Discretized:
h_t = Δ * h_{t-1} + BΜ * x_t
y_t = C * h_t
Mamba adds: selective scan mechanism (S4 + selectivity)
Ξ, B, C are input-dependent (unlike fixed SSMs)
Advantages:
- Linear O(N) training time
- O(1) memory per inference step
- Competitive with Transformers on long sequences
- Mamba 2 β parallel scan, SSD (Structured State Space Duality)
- Jamba β interleaved Mamba + Transformer layers (AI21)
- Falcon Mamba β pure SSM language model
13.7 Mixture of Experts (MoE) Architecture
Expert Router:
g(x) = Softmax(TopK(x @ W_router))
Each token routes to K experts:
output = Ξ£ g_k(x) * Expert_k(x)
Load Balancing:
L_aux = Ξ± * Ξ£_i f_i * P_i
f_i = fraction of tokens routed to expert i
P_i = fraction of router probability on expert i
- Mixtral 8Γ7B: 8 FFN experts, 2 active; effectively 12.9B active
- DeepSeek-V3: 256 experts, 8 active + 1 shared expert always active
- Switch Transformer: top-1 routing, simpler but less expressive
14. Complete Design & Development Process
14.1 Phase 0: Problem Definition & Scoping (Weeks 1-2)
Decision Framework:
βββ Model Purpose
β βββ General assistant (broad capability)
β βββ Domain specialist (legal, medical, code)
β βββ Multilingual (cover which languages?)
β βββ Multimodal (text + image + audio?)
βββ Scale Decision
β βββ <1B: edge deployment, specialized tasks
β βββ 1-7B: consumer hardware, good balance
β βββ 7-70B: server deployment, high capability
β βββ 70B+: frontier capability, data center
βββ Compute Budget
β βββ GPU-hours Γ cost β max tokens you can train
β βββ Use Chinchilla formula for optimal allocation
β βββ Factor in inference cost at scale
βββ Success Metrics
βββ Benchmark targets (MMLU, HumanEval, etc.)
βββ Latency requirements (tokens/sec)
βββ Cost per token at serving scale
14.2 Phase 1: Data Pipeline (Months 1-2)
Step 1: Data Acquisition
- Download Common Crawl (use cc-net or datatrove)
- Acquire books, code, scientific papers
- License check everything
Step 2: Setup Processing Infrastructure
pip install datatrove apache-beam
# Spark cluster or Ray cluster for scale
Step 3: Implement Quality Pipeline
quality_pipeline = [
URLFilter(block_list=ADULT_DOMAINS),
LanguageFilter(languages=["en"], min_prob=0.65),
GopherQualityFilter(min_words=50, max_ratio_bullet_lines=0.9),
C4QualityFilter(),
ParagraphFilter(min_paragraphs=3),
]
Step 4: Deduplication
minhash_dedup = MinHashDedup(
n_shingles=5,
n_buckets=14,
n_hashes_per_bucket=8,
threshold=0.7
)
Step 5: Tokenize & Pack
# BPE tokenizer training
tokenizer = Tokenizer(BPE(unk_token="<unk>"))
tokenizer.train(files=corpus_files, vocab_size=32000)
# Pack sequences to max_length, add BOS/EOS
# Use numpy memmap for efficient storage
14.3 Phase 2: Model Implementation (Month 2-3)
# Complete minimal Llama-style transformer implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
from dataclasses import dataclass
@dataclass
class ModelConfig:
vocab_size: int = 32000
d_model: int = 4096
n_layers: int = 32
n_heads: int = 32
n_kv_heads: int = 8 # GQA
max_seq_len: int = 4096
ffn_dim: int = 14336
rms_norm_eps: float = 1e-5
class RMSNorm(nn.Module):
def __init__(self, dim, eps=1e-5):
super().__init__()
self.eps = eps
self.weight = nn.Parameter(torch.ones(dim))
def forward(self, x):
norm = x.pow(2).mean(-1, keepdim=True).add(self.eps).rsqrt()
return x * norm * self.weight
def precompute_freqs(dim, max_len, theta=10000.0):
freqs = 1.0 / (theta ** (torch.arange(0, dim, 2).float() / dim))
t = torch.arange(max_len)
freqs = torch.outer(t, freqs)
freqs_cis = torch.polar(torch.ones_like(freqs), freqs)
return freqs_cis
def apply_rotary_emb(xq, xk, freqs_cis):
xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
freqs_cis = freqs_cis[:xq.shape[1]].unsqueeze(0).unsqueeze(2)
xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
return xq_out.type_as(xq), xk_out.type_as(xk)
class Attention(nn.Module):
def __init__(self, config):
super().__init__()
self.n_heads = config.n_heads
self.n_kv_heads = config.n_kv_heads
self.head_dim = config.d_model // config.n_heads
self.n_rep = self.n_heads // self.n_kv_heads
self.wq = nn.Linear(config.d_model, config.n_heads * self.head_dim, bias=False)
self.wk = nn.Linear(config.d_model, config.n_kv_heads * self.head_dim, bias=False)
self.wv = nn.Linear(config.d_model, config.n_kv_heads * self.head_dim, bias=False)
self.wo = nn.Linear(config.n_heads * self.head_dim, config.d_model, bias=False)
def forward(self, x, freqs_cis, mask=None):
B, T, _ = x.shape
xq = self.wq(x).view(B, T, self.n_heads, self.head_dim)
xk = self.wk(x).view(B, T, self.n_kv_heads, self.head_dim)
xv = self.wv(x).view(B, T, self.n_kv_heads, self.head_dim)
xq, xk = apply_rotary_emb(xq, xk, freqs_cis)
# Expand KV for GQA
xk = xk.repeat_interleave(self.n_rep, dim=2)
xv = xv.repeat_interleave(self.n_rep, dim=2)
# Flash Attention via scaled_dot_product_attention
xq = xq.transpose(1, 2)
xk = xk.transpose(1, 2)
xv = xv.transpose(1, 2)
out = F.scaled_dot_product_attention(xq, xk, xv,
attn_mask=mask,
is_causal=True)
out = out.transpose(1, 2).contiguous().view(B, T, -1)
return self.wo(out)
class SwiGLU(nn.Module):
def __init__(self, config):
super().__init__()
self.w1 = nn.Linear(config.d_model, config.ffn_dim, bias=False)
self.w2 = nn.Linear(config.ffn_dim, config.d_model, bias=False)
self.w3 = nn.Linear(config.d_model, config.ffn_dim, bias=False)
def forward(self, x):
return self.w2(F.silu(self.w1(x)) * self.w3(x))
class TransformerBlock(nn.Module):
def __init__(self, config):
super().__init__()
self.attention = Attention(config)
self.feed_forward = SwiGLU(config)
self.attention_norm = RMSNorm(config.d_model, config.rms_norm_eps)
self.ffn_norm = RMSNorm(config.d_model, config.rms_norm_eps)
def forward(self, x, freqs_cis, mask=None):
x = x + self.attention(self.attention_norm(x), freqs_cis, mask)
x = x + self.feed_forward(self.ffn_norm(x))
return x
class Transformer(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.embeddings = nn.Embedding(config.vocab_size, config.d_model)
self.layers = nn.ModuleList([TransformerBlock(config) for _ in range(config.n_layers)])
self.norm = RMSNorm(config.d_model, config.rms_norm_eps)
self.lm_head = nn.Linear(config.d_model, config.vocab_size, bias=False)
# Tie weights
self.lm_head.weight = self.embeddings.weight
# Precompute RoPE frequencies
self.freqs_cis = precompute_freqs(config.d_model // config.n_heads, config.max_seq_len)
def forward(self, tokens, targets=None):
B, T = tokens.shape
x = self.embeddings(tokens)
freqs_cis = self.freqs_cis[:T].to(x.device)
for layer in self.layers:
x = layer(x, freqs_cis)
x = self.norm(x)
logits = self.lm_head(x)
loss = None
if targets is not None:
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
return logits, loss
14.4 Phase 3: Training Infrastructure (Month 3-4)
# Training loop with FSDP + gradient accumulation
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
import functools
def setup_training(config, model, train_dataset):
# FSDP wrapping
auto_wrap_policy = functools.partial(
transformer_auto_wrap_policy,
transformer_layer_cls={TransformerBlock}
)
model = FSDP(model, auto_wrap_policy=auto_wrap_policy,
mixed_precision=MixedPrecision(
param_dtype=torch.bfloat16,
reduce_dtype=torch.bfloat16,
buffer_dtype=torch.bfloat16,
))
# Optimizer
optimizer = torch.optim.AdamW(
model.parameters(),
lr=3e-4,
betas=(0.9, 0.95),
eps=1e-8,
weight_decay=0.1
)
# Scheduler: warmup + cosine decay
def lr_lambda(step):
if step < warmup_steps:
return step / warmup_steps
progress = (step - warmup_steps) / (total_steps - warmup_steps)
return max(0.1, 0.5 * (1 + math.cos(math.pi * progress)))
scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
return model, optimizer, scheduler
# Training step
def train_step(model, batch, optimizer, scaler, grad_accum_steps):
tokens, targets = batch
with torch.cuda.amp.autocast(dtype=torch.bfloat16):
logits, loss = model(tokens, targets)
loss = loss / grad_accum_steps
loss.backward()
if step % grad_accum_steps == 0:
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
return loss.item() * grad_accum_steps
14.5 Phase 4: Evaluation & Iteration (Ongoing)
Evaluation Checkpoints:
Every 1000 steps:
- Validation loss (held-out data)
- Perplexity on test set
Every 5000 steps:
- Run lm-evaluation-harness on core benchmarks
- MMLU, HellaSwag, ARC, TruthfulQA
After pre-training completes:
- Full benchmark suite
- Human evaluation samples
- Red-teaming for safety issues
14.6 Phase 5: Post-Training (Month 5-6)
1. Collect SFT data:
- Buy/license instruction datasets
- Use GPT-4 to generate synthetic data
- Human annotators for quality examples
2. Fine-tune with TRL/Axolotl:
accelerate launch train_sft.py \
--model_name_or_path base_model/ \
--dataset_name sft_data \
--max_seq_length 4096 \
--num_train_epochs 3 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4
3. Collect preference data:
- Sample 2+ outputs for each prompt
- Human annotators rank outputs
- Tools: LabelStudio, Argilla, Scale AI
4. Train reward model:
python train_reward_model.py \
--model sft_model/ \
--data preference_data.json
5. RLHF/DPO:
python train_dpo.py \
--model sft_model/ \
--reward_model rm_model/ \
--beta 0.1
15. Reverse Engineering Existing LLMs
15.1 Approach & Methodology
Reverse engineering modern LLMs means studying their papers, open implementations, and behavioral analysis to understand design decisions.
15.2 Reverse Engineering GPT-4 (What We Know)
- Architecture (from papers + leaks):
- Mixture of Experts: ~8 experts, 2 active per token
- Estimated 1.8T total params, ~200B active per forward pass
- ~120 transformer layers
- Context: 128K tokens (GPT-4 Turbo)
- Training Data: ~13T tokens estimated
- RLHF: Extensive human feedback + InstructGPT methodology
- Safety: Constitutional AI-like red-teaming
- Multimodal: CLIP-style vision encoder + projection
15.3 Reverse Engineering Llama 3.1 (Open Weights)
# Inspect Llama 3.1 70B architecture
from transformers import AutoModelForCausalLM, AutoConfig
config = AutoConfig.from_pretrained("meta-llama/Meta-Llama-3.1-70B")
print(config)
# Key findings:
# hidden_size: 8192
# intermediate_size: 28672
# num_attention_heads: 64
# num_key_value_heads: 8 β GQA (8 KV heads vs 64 Q heads)
# num_hidden_layers: 80
# rope_theta: 500000.0
# vocab_size: 128256
# max_position_embeddings: 131072
# rms_norm_eps: 1e-05
# hidden_act: "silu"
15.4 Behavioral Reverse Engineering
Techniques:
1. Prompt probing β test specific capabilities systematically
2. Activation patching β identify which layers encode which info
(requires white-box access or similar open model)
3. Mechanistic interpretability:
- Identify attention head functions (induction heads, copy heads)
- Superposition hypothesis: polysemantic neurons
- Sparse autoencoders to find features (Anthropic's SAE work)
4. Logit lens β project intermediate representations to vocab
5. Activation analysis β t-SNE/UMAP of hidden states
6. Probing classifiers β train linear probes on hidden states
15.5 Studying Open Source LLMs
Key open models to study (in order of insight value):
1. GPT-2 (117M) β OpenAI, fully open, educational
git clone https://github.com/openai/gpt-2
2. LLaMA 3 (8B-405B) β Meta, open weights + tokenizer details
Excellent reference architecture
3. Mistral 7B β reference for sliding window + GQA
4. Falcon (1B-180B) β Technology Innovation Institute
Original GQA + MQA reference
5. Pythia (70M-12B) β EleutherAI, training checkpoints available
Study training dynamics over time
6. OLMo (7B) β Allen AI, truly open (code + data + checkpoints)
Best for training process study
7. MosaicML MPT β HuggingFace-native architecture
Study approach:
- Read architecture paper
- Clone training codebase
- Trace forward pass manually
- Measure parameter counts per component
- Profile memory and compute requirements
16. Building Your Own LLM Service
16.1 Service Architecture Overview
βββββββββββββββββββββββ
β Load Balancer β
β (nginx/Traefik) β
ββββββββββββ¬βββββββββββ
β
βββββββββββββββββββββββΌββββββββββββββββββββββ
β β β
ββββββββββΌββββββββ ββββββββββΌββββββββ ββββββββββΌββββββββ
β API Server β β API Server β β API Server β
β (FastAPI) β β (FastAPI) β β (FastAPI) β
ββββββββββ¬ββββββββ ββββββββββ¬ββββββββ ββββββββββ¬ββββββββ
β β β
βββββββββββββββββββββββΌββββββββββββββββββββββ
β
βββββββββββΌββββββββββ
β Request Router β
β (Priority Queue) β
βββββββββββ¬ββββββββββ
β
βββββββββββββββββββββββΌββββββββββββββββββββββ
β β β
ββββββββββΌββββββββ ββββββββββΌββββββββ ββββββββββΌββββββββ
β Inference Node β β Inference Node β β Inference Node β
β vLLM / TGI β β vLLM / TGI β β vLLM / TGI β
β (4ΓH100) β β (4ΓH100) β β (4ΓH100) β
ββββββββββββββββββ ββββββββββββββββββ ββββββββββββββββββ
β
ββββββββββΌβββββββββββββββββββββββββββββββ
β Supporting Services β
β Redis (cache) | PostgreSQL (logs) β
β Prometheus (metrics) | Grafana (viz) β
β MinIO (model artifacts) β
ββββββββββββββββββββββββββββββββββββββββββ
16.2 API Layer Implementation
# FastAPI server for LLM service
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import asyncio
from vllm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams
app = FastAPI(title="LLM API Service")
# Initialize vLLM engine
engine_args = AsyncEngineArgs(
model="your_model_path",
tensor_parallel_size=4, # 4 GPUs
gpu_memory_utilization=0.95,
max_num_batched_tokens=32768,
max_num_seqs=256,
enable_chunked_prefill=True,
)
engine = AsyncLLMEngine.from_engine_args(engine_args)
class ChatRequest(BaseModel):
messages: list[dict]
max_tokens: int = 2048
temperature: float = 0.7
top_p: float = 0.9
stream: bool = False
@app.post("/v1/chat/completions")
async def chat_completions(request: ChatRequest):
# Apply chat template
prompt = apply_chat_template(request.messages)
sampling_params = SamplingParams(
max_tokens=request.max_tokens,
temperature=request.temperature,
top_p=request.top_p,
)
if request.stream:
return StreamingResponse(
stream_generator(prompt, sampling_params),
media_type="text/event-stream"
)
# Non-streaming
results = await engine.generate(prompt, sampling_params, request_id=str(uuid4()))
async for result in results:
final_output = result
return format_openai_response(final_output)
async def stream_generator(prompt, sampling_params):
async for output in engine.generate(prompt, sampling_params, str(uuid4())):
chunk = format_stream_chunk(output)
yield f"data: {json.dumps(chunk)}\n\n"
yield "data: [DONE]\n\n"
16.3 Deployment with Kubernetes
# k8s deployment for LLM inference
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-inference
spec:
replicas: 3
selector:
matchLabels:
app: llm-inference
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model
- /models/llama-3-70b
- --tensor-parallel-size
- "4"
- --max-num-batched-tokens
- "32768"
- --port
- "8000"
resources:
limits:
nvidia.com/gpu: 4
requests:
memory: "200Gi"
cpu: "32"
volumeMounts:
- name: model-storage
mountPath: /models
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: model-pvc
nodeSelector:
nvidia.com/gpu.product: "H100-SXM-80GB"
16.4 Monitoring & Observability
# Prometheus metrics for LLM service
from prometheus_client import Counter, Histogram, Gauge
REQUEST_COUNT = Counter('llm_requests_total', 'Total requests', ['model', 'status'])
REQUEST_LATENCY = Histogram('llm_request_latency_seconds',
'Request latency', ['model'],
buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0])
TOKENS_GENERATED = Counter('llm_tokens_generated_total', 'Tokens generated', ['model'])
GPU_MEMORY_USED = Gauge('llm_gpu_memory_bytes', 'GPU memory used', ['gpu_id'])
QUEUE_SIZE = Gauge('llm_queue_size', 'Current queue depth')
16.5 Cost Estimation
Infrastructure Cost Example (7B model, 100K daily users):
Serving: 2Γ 8ΓA100 nodes (AWS p4d.24xlarge)
Cost: ~$32/hr/node Γ 2 = $64/hr = $1,536/day = $46K/month
Storage: 100TB (model, logs, cache)
Cost: ~$2,300/month (S3)
Network: 10TB outbound/day
Cost: ~$900/month
Training (one-time, 7B model):
~7B params Γ 140B tokens / (312 TFLOPS Γ 0.4 efficiency)
β 9.6M GPU-hours on A100 β Actually ~300K GPU-hours
Cost: ~$300K one-time for quality 7B model
Break-even: ~$0.001/1K tokens at scale
17. Cutting-Edge Developments
17.1 Test-Time Compute Scaling (2024-2025)
- OpenAI o1/o3 β extended chain-of-thought reasoning
- Models "think" for seconds to minutes before answering
- Process Reward Models (PRMs) guide reasoning
- MCTS/beam search over reasoning steps
- Breakthrough on AIME math, competition programming
- DeepSeek-R1 β open-source reasoning model
- GRPO training (Group Relative Policy Optimization)
- RL directly on reasoning without SRM labeling
- Matches o1 on many benchmarks at lower cost
- Test-time compute scaling law: more inference compute β better results
17.2 Multimodal LLMs
- Architecture: Vision encoder β projector β LLM
- CLIP/SigLIP β Linear/MLP β Decoder-only LLM
- GPT-4V/GPT-4o: images, audio, text unified
- Gemini 1.5 Pro: 1M context, native multimodal
- LLaVA / LLaVA-NeXT: open multimodal models
- Qwen-VL: image/video understanding
- Video LLMs: VideoLLaMA, Video-LLaVA, Qwen2-VL
- Any-to-Any: Unified IO, CoDi, NExT-GPT
17.3 Efficient Architecture Innovations
- GQA (2023) β grouped query attention, now standard
- Sliding Window + Full Attention Hybrid β Mistral approach
- MLA (Multi-head Latent Attention) β DeepSeek-V2/V3
- Low-rank KV compression: 93% KV cache reduction
- Match MHA quality with MQA efficiency
- Differential Attention β Microsoft 2024
- Cancel noise in attention with difference of two softmax
- Linear Attention / RetNet / RWKV / Mamba
- Subquadratic alternatives to standard attention
- TTT (Test-Time Training) β context as gradient descent
17.4 Training Innovations
- Flash Attention 3 β hardware-aware for H100 FP8
- FP8 Training β native 8-bit training on H100
- Online RLHF β continuously update RM with new data
- RLAIF β AI feedback replacing human annotation
- Constitutional AI 2.0 β multi-principle alignment
- Direct Preference Optimization variants (IPO, KTO, ORPO)
- Synthetic Data Generation β Phi series, Llama instillation
- Curriculum Learning β easyβhard data ordering
- Data Attribution β identify most influential training examples
17.5 Inference & Serving Innovations
- Speculative Decoding β 2-4Γ speedup, no quality loss
- Medusa / EAGLE β parallel decoding heads
- Continuous Batching β vLLM's signature feature
- Chunked Prefill β interleave prefill and decode
- Prefix Caching β reuse KV cache across requests
- Quantization advances: GPTQ, AWQ, AQLM, FP8 inference
- MoE routing optimization β expert parallelism
- Disaggregated prefill/decode β separate servers for each phase
17.6 Long Context & Memory
- Retrieval Augmented Generation 2.0
- Self-RAG, FLARE, Adaptive RAG
- Multi-hop reasoning over retrieved docs
- Infinite context: StreamingLLM, MemGPT, Infini-Attention
- Memory networks: Titans (2025), neural long-term memory
- 1M context: Gemini 1.5, Claude 3 (200K), Llama 3.1 (128K)
- Persistent memory systems: vector databases + LLM
17.7 Agentic AI (2024-2025)
- Tool use / Function calling β structured JSON outputs
- Code execution β Python interpreter as tool
- Browser agents β web navigation (Computer Use, WebAgent)
- Multi-agent systems β AutoGen, CrewAI, LangGraph
- Long-horizon planning β hierarchical task decomposition
- World models β model-based reasoning about environment
18. Build Ideas β Beginner to Advanced
π’ Beginner Level (Months 1-6)
Project 1: GPT from Scratch (The Classic) Beginner
Goal: Build and train a character-level GPT
Skills: PyTorch basics, attention, training loop
Dataset: tiny_shakespeare.txt (~1MB)
Model: ~10K-100K parameters
Reference: Andrej Karpathy's "nanoGPT" tutorial
Project 2: Train a Tiny Tokenizer Beginner
Goal: Implement BPE tokenizer from scratch
Skills: String processing, Python
Dataset: Text corpus of your choice
Deliverable: Custom tokenizer matching tiktoken output
Project 3: BERT Fine-Tuning for Classification Beginner
Goal: Fine-tune BERT for sentiment analysis
Skills: HuggingFace Transformers, fine-tuning
Dataset: SST-2, IMDB, or custom
Deliverable: 90%+ accuracy classifier with API
Project 4: Chatbot with LoRA Fine-Tuning Beginner
Goal: Fine-tune Llama 3.1 8B on custom instructions
Skills: PEFT, QLoRA, Axolotl
Dataset: 1K-10K instruction pairs
Hardware: 1Γ RTX 4090 or Colab A100
Project 5: RAG System Beginner
Goal: Build retrieval-augmented Q&A over documents
Skills: Embeddings, FAISS, LangChain
Components: PDF loader β chunker β embedder β retriever β LLM
π‘ Intermediate Level (Months 6-18)
Project 6: Train a 125M Parameter LLM Intermediate
Goal: Pre-train GPT-2 sized model on domain data
Skills: Distributed training, data pipeline, evaluation
Dataset: 10-50B tokens (domain-specific)
Hardware: 4-8Γ A100 GPUs
Framework: Megatron-LM or custom PyTorch FSDP
Cost: ~$5K-20K compute
Project 7: Reward Model Training Intermediate
Goal: Train a reward model for RLHF
Skills: Preference data collection, Bradley-Terry model
Dataset: 50K+ comparison pairs
Deliverable: RM that scores responses 0-10
Evaluation: Accuracy on held-out comparisons
Project 8: Multimodal LLM (Vision + Text) Intermediate
Goal: Build LLaVA-style model
Architecture: CLIP ViT-L + projection MLP + Llama 3B
Training: 2-stage (align β instruction-tune)
Dataset: LLaVA-CC3M-Pretrain-595K + LLaVA-Instruct-150K
Skills: Multimodal data, vision encoder integration
Project 9: Production Inference Service Intermediate
Goal: Deploy your fine-tuned model as a production API
Components:
- vLLM/TGI inference engine
- FastAPI with streaming support
- Redis for rate limiting + caching
- Prometheus + Grafana monitoring
- Docker Compose β Kubernetes migration
SLA: 99.9% uptime, <500ms p50 latency
Project 10: Code Generation Model Intermediate
Goal: Fine-tune or train a code-specialized LLM
Dataset: The Stack (languages you support)
Eval: HumanEval, MBPP, SWE-Bench
Features: FIM (fill-in-middle), multi-file context
π΄ Advanced Level (Months 18-36+)
Project 11: 7B Parameter Pre-training from Scratch Advanced
Goal: Train a competitive open-source 7B model
Budget: $200K-500K compute (negotiable with optimizations)
Data: 1-2T tokens of curated web + books + code
Architecture: Llama 3-style (GQA, RoPE, SwiGLU, RMSNorm)
Training: 3D parallelism on 64-128Γ H100s
Evaluation: Competitive with Llama 3 8B on MMLU, HellaSwag
Project 12: Full RLHF Pipeline Advanced
Goal: Complete SFT β RM β PPO pipeline
SFT: 500K high-quality instruction examples
RM: 100K preference comparisons, 75%+ agreement accuracy
PPO: Stable training, no mode collapse
Deliverable: RLHF-tuned model preferred over SFT by humans
Tools: OpenRLHF or custom PPO implementation
Project 13: Reasoning Model (o1-style) Advanced
Goal: Build a reasoning model with extended CoT
Approach 1: MCTS + PRM training
Approach 2: GRPO like DeepSeek-R1
Dataset: Math (MATH, AMC, AIME) + code problems
Metric: AIME accuracy, competition math benchmarks
Novel contribution: Improved search algorithm or reward shaping
Project 14: MoE Language Model Advanced
Goal: Build Mixtral-style MoE model
Architecture: 8 experts, top-2 routing, 7B active params
Challenge: Load balancing, expert collapse prevention
Benefit: 47B total params, only 12.9B compute
Framework: Megablocks or custom CUDA kernel
Project 15: LLM Research Contribution Advanced
Goal: Novel research contribution publishable at ACL/NeurIPS/ICLR
Ideas:
- New attention mechanism for long context
- Better data selection algorithm
- Novel PEFT method
- Interpretability finding
- New benchmark or evaluation methodology
- Alignment technique
- Efficient architecture variant
Process: Baseline β ablation β comparison β writeup β submission
19. Research Papers You Must Read
Foundational
- Attention Is All You Need (Vaswani et al., 2017) β The Transformer
- BERT (Devlin et al., 2018) β Bidirectional pre-training
- GPT-2 (Radford et al., 2019) β Language model pre-training
- GPT-3 (Brown et al., 2020) β Few-shot learners, scaling
- Scaling Laws for Neural LMs (Kaplan et al., 2020)
Architecture
- RoFormer/RoPE (Su et al., 2021) β Rotary position embedding
- ALiBi (Press et al., 2021) β Attention with linear biases
- GQA (Ainslie et al., 2023) β Grouped query attention
- FlashAttention (Dao et al., 2022) β IO-aware attention
- FlashAttention-2 (Dao, 2023)
- Mistral 7B (Jiang et al., 2023) β SWA + GQA
- Mixtral (Jiang et al., 2024) β Sparse MoE
- Mamba (Gu & Dao, 2023) β Linear-time sequence modeling
- LLaMA (Touvron et al., 2023) and LLaMA 2 & 3
Training & Optimization
- Chinchilla (Hoffmann et al., 2022) β Scaling laws revised
- PaLM (Chowdhery et al., 2022) β Large-scale language modeling
- Megatron-LM (Shoeybi et al., 2019) β Efficient large model training
- ZeRO (Rajbhandari et al., 2020) β Memory optimization
- AdamW (Loshchilov & Hutter, 2017) β Decoupled weight decay
- Lion Optimizer (Chen et al., 2023)
Alignment & RLHF
- InstructGPT (Ouyang et al., 2022) β RLHF for instruction following
- Constitutional AI (Bai et al., 2022) β Anthropic's alignment
- DPO (Rafailov et al., 2023) β Direct preference optimization
- RLHF (Christiano et al., 2017) β Original RLHF paper
- Self-Play Fine-Tuning (SPIN) (Chen et al., 2024)
Inference
- Speculative Decoding (Leviathan et al., 2022)
- vLLM / PagedAttention (Kwon et al., 2023)
- GPTQ (Frantar et al., 2022) β Post-training quantization
- AWQ (Lin et al., 2023) β Activation-aware quantization
- QLoRA (Dettmers et al., 2023) β Efficient fine-tuning
Reasoning & Capabilities
- Chain-of-Thought Prompting (Wei et al., 2022)
- Self-Consistency (Wang et al., 2022)
- Tree of Thoughts (Yao et al., 2023)
- ReAct (Yao et al., 2022) β Reasoning + acting
- DeepSeek-R1 (DeepSeek, 2025) β Open reasoning model
Recent (2024-2025)
- DeepSeek-V3 (2024) β Efficient large MoE
- Gemini 1.5 (2024) β 1M context
- Claude 3 Technical Report β Constitutional AI advances
- Llama 3 (Meta, 2024) β Technical report
- Titans (2025) β Neural long-term memory
20. Complete Learning Timeline
Phase 1: Foundations (Months 1-3)
Month 1: Math & Programming
Week 1-2: Linear algebra (3Blue1Brown + Gilbert Strang MIT)
Week 3-4: Calculus, probability, statistics (Khan Academy + Bishop PRML)
Month 2: ML & DL Basics
Week 1-2: Classical ML (Andrew Ng Coursera)
Week 3-4: Deep learning (fast.ai Part 1, or d2l.ai)
Month 3: NLP & Transformers
Week 1-2: NLP fundamentals, word vectors
Week 3-4: Transformer from scratch + HuggingFace ecosystem
Project: Train character-level GPT on Shakespeare
Phase 2: LLM Fundamentals (Months 4-6)
Month 4: Transformer Internals
- Read: "Attention Is All You Need", GPT-2 paper, BERT paper
- Implement: Multi-head attention, RoPE, RMSNorm from scratch
- Project: Fine-tune BERT on custom classification task
Month 5: Training at Scale
- Study: Megatron-LM, DeepSpeed ZeRO, FSDP
- Implement: Distributed training with FSDP on 2-4 GPUs
- Project: Train 125M GPT on ~1B token dataset
Month 6: Fine-Tuning & Alignment
- Study: LoRA, QLoRA, SFT, DPO papers
- Implement: LoRA adapter, QLoRA training pipeline
- Project: Fine-tune Llama 3 8B on instruction dataset with QLoRA
Phase 3: Intermediate Skills (Months 7-12)
Month 7-8: Data Pipeline Engineering
- Web scraping at scale, datatrove
- Deduplication with MinHash
- Quality filtering pipeline
- Project: Build 10B token domain corpus
Month 9-10: Production Serving
- vLLM deployment, FastAPI, Docker, K8s
- Monitoring, autoscaling, caching
- Project: Deploy fine-tuned 7B model as production API
Month 11-12: Evaluation & Benchmarking
- Run lm-evaluation-harness
- Build custom eval suite
- Understand benchmarks: MMLU, HumanEval, MT-Bench
- Project: Comprehensive eval of your model vs. baselines
Phase 4: Advanced Training (Months 13-24)
Month 13-15: Pre-training from Scratch
- Architect and implement 7B parameter model
- Data pipeline: 500B-1T tokens
- 3D parallel training on H100 cluster
- Training stability, loss monitoring, recovery
Month 16-18: Full RLHF Pipeline
- Preference data collection tools
- Reward model training and evaluation
- PPO or DPO training
- Safety evaluation + red-teaming
Month 19-21: Advanced Topics
- Mixture of Experts
- Multimodal extensions (vision + language)
- Long context techniques
- Speculative decoding
Month 22-24: Research Contribution
- Novel technique or finding
- Paper writing + submission
- Open-source contribution
Phase 5: Mastery (Month 24+)
- Lead model development at company or in open source
- Publish research papers
- Build novel architectures
- Start your own AI company or project
- Contribute to frontier model development
π Essential Resources
Books
- "Deep Learning" β Goodfellow, Bengio, Courville (free online)
- "The Little Book of Deep Learning" β FranΓ§ois Fleuret (free PDF)
- "Dive into Deep Learning" (d2l.ai) β interactive, PyTorch/JAX/TF
- "Pattern Recognition and Machine Learning" β Bishop
- "Mathematics for Machine Learning" β Deisenroth et al. (free PDF)
- "Speech and Language Processing" β Jurafsky & Martin (free PDF)
- "The Alignment Problem" β Brian Christian
Online Courses
- fast.ai β Practical Deep Learning for Coders (FREE)
- Andrej Karpathy: Zero to Hero β YouTube (FREE) β BEST starting point
- DeepLearning.AI Specializations β Coursera
- Stanford CS224N β NLP with Deep Learning (free lectures on YouTube)
- Stanford CS336 β Language Modeling from Scratch (2024, free)
- MIT 6.S191 β Introduction to Deep Learning
Blogs & Communities
- Lilian Weng's Blog (lilianweng.github.io) β authoritative ML explainers
- Sebastian Ruder's Blog β NLP research
- Andrej Karpathy's Blog β insightful posts
- HuggingFace Blog β practical tutorials
- Anthropic Research Blog β alignment and safety
- Google AI Blog, Meta AI Blog, OpenAI Blog
- EleutherAI Discord β open-source LLM community
- r/MachineLearning, r/LocalLLaMA
- Papers with Code β benchmarks + implementations
- The Gradient β accessible research explainers
GitHub Repositories to Study
karpathy/nanoGPTβ minimal, educational GPTkarpathy/llm.cβ LLM in raw C/CUDAmeta-llama/llamaβ reference Llama implementationvllm-project/vllmβ production inferencehuggingface/transformersβ universal model libraryEleutherAI/lm-evaluation-harnessβ benchmarkinghiyouga/LLaMA-Factoryβ fine-tuning factorymicrosoft/DeepSpeedβ training optimizationNVIDIA/Megatron-LMβ large-scale trainingallenai/OLMoβ fully open LLM